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1. Introduction 

Entity resolution (ER) is the task of identifying records belonging to the same entity 
(e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: 
deduplication and record linkage, among others. In this paper we survey metrics used to 
evaluate ER results in order to iteratively improve performance and guarantee sufficient 
quality prior to deployment. Some of these metrics are borrowed from multi-class clas- 
siheation and clustering domains, though some key differences exist differentiating entity 
resolution from general clustering. Menestrina et al. empirically showed rankings from 
these metrics often conflict with each other, thus our primary motivation for studying 
them [1]. This paper provides practitioners the basic knowledge to begin evaluating their 
entity resolution results. 


2. Problem Statement 

Our notation follows that of [1]. Consider an input set of records I = {a, h, c, d, e} where 
a,b,c,d, and e are unique records. Let R = {{a,b,d), {c,e)} denote an entity resolution 
clustering output, where (...) denotes a cluster. Let S be the true clustering, referred to 
as the “gold standard.” The goal of any entity resolution metric is to measure error (or 
similarity) of R compared to the gold standard S. 

3. Pairwise Metrics 

Pairwise metrics consider every pair of records as samples for evaluating performance. 
Let Pairs{R) denote all the intra-cluster pairs in the clustering R. In our example, 
Pairs{R) = {(a, 6), (a, d), (b, d), (c, e)}. Confusingly, some studies treat pairs only as those 
where a direct match was made and not matches made through transitive relations [2] . For 
example, [2] would exclude (a, d) if the matches leading to R were a ^ b, b ^ d, and c « e, 
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where ~ denotes a match. We choose the former definition because it is independent of 
the underlying matching process - it only depends on the final entity resolution results. 

Unlike many machine learning classification tasks, we never consider non-matches (i.e. 
inter-cluster pairs) in entity resolution metrics [3]. In conventional clustering tasks, the 
number of clusters is constant or sub-linear with respect to the number of records n [4]. 
However, the number of clusters is 0{n) in conventional ER tasks. So though the number 
of intra-cluster pairs is 0{n) (e.g. true positives), the number of inter-cluster pairs (e.g. 
true negatives) is O(n^). To illustrate, consider our original example with 5 records and 
2 clusters. There are 4 intra-cluster pairs and 6 inter-cluster pairs. Now, compare this to 
a larger database with 50 records and 20 clusters, all of comparable size to the original 
example. There will be approximately 40 intra-cluster pairs but likely over 2000 inter¬ 
cluster pairs. Thus, metrics using inter-cluster pairs (e.g. False Positive Rate) will improve 
exponentially with respect to the number of records in the database and provide overly 
optimistic results for large databases. 


3.1. Pairwise Precision, Recall, and Fi. Using Pairs as the samples, the pairwise 
precision and recall metric functions follow conventional machine learning definitions. The 
harmonic mean of these metrics leads to the most frequently used entity resolution metric, 
pairwise Fi. All these metrics are bound from [0,1]. 


( 1 ) 


PairPrecision{R, S) 


\Pairs{R) n Pairs{S)\ 
\Pairs{R)\ 


( 2 ) 


PairRecall{R, S) 


\Pairs{R) n Pairs{S)\ 
\Pairs{S)\ 


. 2 * PairPrecision{R, S) * PairRecall{R, S) 

^ ’ P air P r eci sion{R, S) + Pair Recall {R, S) 

The benefit of pairwise metrics is their intuitive interpretation. Pairwise precision is 
the percentage of matches in the predicted clustering that are correct. Pairwise recall is 
the percentage of matches in the true clustering that are also in the predicted clustering. 
Unfortunately pairwise metrics may convey overly optimistic results, depending on the 
use case. For example, in many entity resolution tasks the end user only cares about 
the final entity - not the records it comprises. Mismatching two singleton entities has an 
insignificant impact on pairwise metrics compared to incorrectly joining or splitting two 
large clusters. 


4. Cluster Metrics 

Like the pairwise metrics, all the cluster metrics discussed here are bound by [0,1], a 
convenient property when comparing across datasets and for setting quality standards. 
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4.1. Cluster Precision, Recall, and Fi. Cluster level metrics attempt to capture a more 

holistic understanding of the final entities. At the extreme opposite of pairwise metrics, 
cluster level precision [5] and recall [6] consider exact cluster matches. Mathematically, 
cluster precision and recall are defined as and respectively. Now, mismatching 

two singleton entities will have the same impact as mismatching two larger clusters. Obvi¬ 
ously, this metric has the opposite drawback - even one corrupted match in a cluster will 
cause an entire cluster to mismatch due to the use of exact comparisons. Thus, this metric 
is rarely used in favor of its predecessor, closest cluster precision, recall, and Fi. 

4.2. Closest Cluster Precision, Recall, and Pi. Closest cluster metrics correct for the 
previous cluster-level drawbacks by incorporating a notion of cluster similarity [7]. Using 
the Jaccard similarity coefficient J{r,s) = to capture cluster similarity, the precision 
and recall can be expressed as 


(4) ccPrecision{R, S) = ses{Jir, s)) 

\R\ 

(5) ccRecalUR, S) = 

\^\ 

where r and s are clusters in R and S, respectively. This metric, and many of the ones 
following, attempt to balance the tradeoffs of the pairwise and exact cluster metrics. 

4.3. Purity and K. Cluster purity was first proposed in 1998 [8] and later extended to 
Average Cluster Purity (ACP) and Average Author Purity (AAP) (archaically referred to 
as Average Speaker Purity) [9]. The ACP and AAP are defined as 

reR seS 

reR seS 

Then the K measure is defined as the geometric mean of these values, K = y/AAP * ACP. 
In many applications only a single purity metric is evaluated, usually something comparable 
to ACP. For example, [10] considers the dominant class in each cluster by defining purity 
as p = ^ Z]reijmaxse 5 |r n s|. The use of this single metric is misleading and only shows 
one half of the precision/recall coin. As an extreme example, setting |i2| = N (i.e. each 
record in its own cluster) would achieve a perfect p = 1.0, yet is clearly far from ideal. 










4 


A PRACTIONER’S GUIDE TO EVALUATING ENTITY RESOLUTION RESULTS 


4.4. Homogeneity, Completeness, and V-Measure. Homogeneity and completeness 
are entropy based metrics, somewhat analogous to precision and recall, respectively [11]. 
A cluster in R has perfect homogeneity if all records belong to the same cluster in S. 
Conversely, a cluster in S has perfect completeness if all its records belong to the same 
cluster in R. Entropy H and its conditional variation are defined as 


( 8 ) 


H{S) 



^^|rns|log 

seS reR 


EreR\r n 5 

| 5 | 


(9) 


H{S\R) 


1 

N 


^^|rns|log 

reR seS 


\r n s\ 
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where N is the total number of records. Using these entropies, homogeneity and complete¬ 
ness are defined as: 


( 10 ) 


( 11 ) 


Homogeneity{R^ S) 


Completeness{R, S) 





if H{S) = 0 

H(S\R) 

HRsy 

else 


if H{R) = 0 

H{R\S) 

H{R) 

else 


V-Measure is defined analogously to the Fi metric as the harmonic mean of homogeneity 
and completeness. 


( 12 ) 




(1 -b /3^) * Homogeneity{R, S) * Completeness{R, S) 
(3“^ * Homogeneity{R, S) -|- Completeness{R, S) 


where /3 is a user defined parameter, usually set to /3 = 1 as in the Fi metric. Completeness 
is weighed more importantly if /? > 1 and homogeneity is weighed more importantly if 
(3 < 1. Some sources use {3 instead of /3^ weighting, we chose the latter due to popularity. 


4.5. Other Metrics. The natural language processing community uses several other en¬ 
tity resolution metrics, which are rarely using in machine learning and database applications 
[12]. We refer the reader to MUC-6 [13], B^Fi [14], and CEAF [15]. 


5. Edit Distance Metrics 

Edit distance metrics can be thought of similarly to string edit distance functions. They 
are a measure of the information lost and gained while modifying R to S. Unfortunately, 
they do not have the convenient [0,1] bound and are thus difficult to relate to any notion 
of a ‘good’ score. 
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5.1. Variation of Information. VI [16] can conveniently be expressed with the previous 
conditional entropy metric [11]. 

(13) VI{R,S)=H{S\R) + H{R\S) 

An important property of VI is it does not directly depend on N, only the sizes of the 
clusters. Thus, it is acceptable to add records from new clusters to a database while 
continuously measuring VI performance. 

5.2. Generalized Merge Distance. Generalized Merge Distance (GMD) is perhaps the 
most comprehensive metric in the sense it can be used to directly calculate several other 
metrics [1]. GMD{R, S) is the minimum legal path cost of converting R to S, where the 
cost of splitting and merging sets of records are user-defined operation-order-independent 
functions. Many such functions exist, such as f{x,y) = k, f{x,y) = kxy, f{x,y) = 
ki + k 2 xy where x and y are the size of the record sets to split or merge and A: is a constant. 
We refer the reader to [17] for a background of operation-order-independence functions. 

Menestrina et al. not only show GMD{R, S) can be computed in linear time, but 
explicitly show how pairwise precision, recall, Ti, and VI can be computed using specific 
cost functions. Depending on the choice of cost functions, GMD is likely dependent on N 
(the cost functions used in the VI formulation are one exception) and difficult to compare 
across datasets of different sizes. 

5.3. Conclusion. Simple examples show a promising pairwise metric may have poor 
cluster-level performance [2]. More rigorous analysis shows this is not only possible, but 
common across a range of applications [1]. At an absolute minimum, we recommend eval¬ 
uating with pairwise Fi because of its simplicity and popularity. We also recommend the 
use of a cluster metric and Generalized Merge Distance - which could conveniently be 
configured to calculate VI and the pairwise Fi in linear time. 

All the metrics discussed herein rely on the availability of a “gold standard” S. In 
practice, human-labeled results rarely number beyond several thousand samples. On large 
datasets, a relative gold standard may be obtained by foregoing blocking efficiency and 
running an exhaustive ER algorithm on the entire database [1]. We note, however, that 
doing so on databases larger than even 10,000 records is infeasible for some algorithms [7]. 
Further, an exhaustive approach is still only an approximation and carries no guarantees 
relative to the true clustering. A need exists for semi- and un-supervised evaluation metrics. 
Some metrics exist for a very specific subset of circumstances, but for the majority of 
applications the general research problem is still open [18]. 
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